1 - Exploratory Data Analysis: Trend, Seasonality and Autocorrelation

df<- read_csv("candy_production.csv")
## Rows: 548 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): IPG3113N
## date (1): observation_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(df)<-c("date", "production")
df$date<-as.Date(df$date, format="%Y-%m-%d")
CandyXTS<- xts(df[-1], df[[1]])
CandyTS <- ts(df$production, start=c(1972,1),end=c(2017,8), frequency=12 )
ts_plot(CandyXTS, title="Candy Production Time Series")
19801990200020105060708090100110120130140
Candy Production Time Series

Trend

ts_decompose(CandyTS)
608010012014080100120−10010201980199020002010−10010
Decomposition of additive time series - CandyTSObservedTrendSeasonalRandom

I can see that the trend of the series is fairly flat up until roughly 1983 and then consistently increases up until about 2006. After this, the candy production trend declines a bit and this decline coincides with the 2008 recession, where it would make sense that output would decrease. The trend plot is clearly not linear.

I can also tell that there is strong seasonal pattern. In the next section I will look at seasonality more in depth.

Seasonality

ts_seasonal(CandyTS, type = "all")
JanFebMarAprMayJunJulAugSepOctNovDec608010012014019801990200020106080100120140JanFebMarAprMayJunJulAugSepOctNovDec6080100120140
1972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017JanFebMarAprMayJunJulAugSepOctNovDecSeasonality Plot - CandyTSBy Frequency CycleBy Frequency UnitBy Frequency Unit

The first plot shows the full frequency cycle of the series, which in this case with monthly series is a full year. The plot’s color scale set by the chronological order of the lines (i.e., from dark color for early years and bright colors for the latest years). From this I can see a yearly seasonal trend, where candy production ramps up in October - December.

The second plot shows each frequency unit over time, and in this case, each line represents a month of consumption over time. It looks as though the seasonal pattern remains the same over time, but overall production has increased over the years.

The last plot is box-plot representative of each frequency unit, which allows us to compare the distribution of each frequency unit. Similar to the plots above, I can see that production increases in the final months of the year.

Next I want to use a periodogram to quantitatively evalate seasonality.

p <- periodogram(CandyTS)

max_freq <- p$freq[which.max(p$spec)]
seasonality <- 1/max_freq
seasonality
## [1] 576

There seems to be two major spikes. Looking at the tallest spike, the seasonality comes out to 576. Since the time series data is monthly, 576 units (or months) would be equivalent to 48 years. I also only have a total of 548 data points so I do not believe 576 is capturing an accurate seasonal pattern.

I will look at the second largest spike in the periodogram and see what it suggests.

# Identify the indices of the two largest spectral densities
largest_indices <- order(p$spec, decreasing = TRUE)[1:2]

# Retrieve the corresponding frequencies
largest_freqs <- p$freq[largest_indices]

# Select the second largest frequency
second_max_freq <- largest_freqs[2]

# Estimate the seasonality based on the second maximum frequency
seasonality_2 <- 1/second_max_freq

# Print the seasonality
seasonality_2
## [1] 12

For this, the seasonality is 12, which would suggest yearly patterns. This also supports the plots from the decomposition plot and the seasonality plots.

Autocorrelation

Acf(CandyTS)

The ACF plot depicted above exhibits signs of non-stationarity, as the lines cross the dashed lines, indicating a noticeable correlation that deviates significantly from zero. Furthermore, I know there is seasonality in this time series, indicating a degree of time-dependence and consequently establishing its non-stationary nature due to the existence of autocorrelation.

2 - Simple models: Average, Naive, Seasonal Naive

train<-window(CandyTS, start=c(1972,1), end=c(2009,12))
test<-window(CandyTS, start=c(2010,1))

Average: forecast will be equal to the average of past data

m_mean<-meanf(train, h=92)
accuracy(m_mean, test)
##                         ME     RMSE       MAE       MPE      MAPE     MASE
## Training set -1.131747e-15 19.03110 16.087687 -4.017003 17.335216 3.195745
## Test set      4.974573e+00 12.11305  9.751367  3.689583  8.981927 1.937064
##                   ACF1 Theil's U
## Training set 0.8782726        NA
## Test set     0.7935696  1.670889

Naive: forecast will be equal to the last observation

m_naive<-naive(train, h=92)
accuracy(m_naive, test)
##                     ME     RMSE       MAE         MPE      MAPE     MASE
## Training set   0.06780  9.34429  6.592331  -0.3756556  6.679489 1.309536
## Test set     -11.74155 16.11966 13.450687 -12.4376238 13.805155 2.671917
##                   ACF1 Theil's U
## Training set 0.2603380        NA
## Test set     0.7935696  2.551853

Seasonal Naive: forecast will be equal to the last observation of same season

m_snaive<-snaive(train, h=92)
accuracy(m_snaive, test)
##                      ME      RMSE       MAE        MPE      MAPE     MASE
## Training set  0.4044957  6.650535  5.034096  0.1927988  5.360879 1.000000
## Test set     11.1960207 14.003395 12.094347 10.9108145 11.712190 2.402486
##                   ACF1 Theil's U
## Training set 0.7571979        NA
## Test set     0.6864750  2.028931

Plots

autoplot(train)+
autolayer(m_mean, series="Mean", PI=FALSE)+
autolayer(m_naive, series="Naive", PI=FALSE)+
autolayer(m_snaive, series="Seasonal Naive", PI=FALSE)+
xlab('Year')+ylab('Candy Production')+
ggtitle('Forecasts for Candy Production')+
guides(colour=guide_legend(title='Forecast'))

When evaluated against the test set, the mean method has better RMSE and MAPE values.

3 - Exponential Smoothing Models: SES, Holt, Holt Winters, ETS

Simple exponential: smoothing for level

m_ses<-ses(train, h=92)
accuracy(m_ses, test)
##                        ME      RMSE      MAE        MPE      MAPE     MASE
## Training set   0.08912664  9.345527  6.59965  -0.349842  6.690203 1.310990
## Test set     -11.74160291 16.119700 13.45072 -12.437675 13.805191 2.671924
##                   ACF1 Theil's U
## Training set 0.2564186        NA
## Test set     0.7935696  2.551859

Holt: smoothing for level and for trend

m_holt<-holt(train, h=92)
accuracy(m_holt, test)
##                        ME      RMSE       MAE        MPE      MAPE     MASE
## Training set   0.04709578  9.349222  6.602357  -0.392977  6.694943 1.311528
## Test set     -13.87178976 17.531921 14.975683 -14.450762 15.326538 2.974850
##                   ACF1 Theil's U
## Training set 0.2557387        NA
## Test set     0.7832662  2.763515

Holt Winters: smoothing for level, trend and seasonality

m_holtw<-hw(train, seasonal="additive", h=92)
accuracy(m_holtw, test)
##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 0.03661278 3.910577 2.978740 -0.05128058 3.102872 0.5917130
## Test set     0.18823019 5.550671 4.634408 -0.04621848 4.467333 0.9206039
##                   ACF1 Theil's U
## Training set 0.1383633        NA
## Test set     0.6976633 0.8265308

ETS: Allows other combinations of trend and seasonal components

Error, Trend, Seasonality
A= additive, M= multiplicative, N= none

m_ets<-forecast(ets(train), h=92)
summary(m_ets)
## 
## Forecast method: ETS(A,N,A)
## 
## Model Information:
## ETS(A,N,A) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.5772 
##     gamma = 0.2617 
## 
##   Initial states:
##     l = 79.9047 
##     s = 22.9395 27.0302 19.3452 -5.8776 -9.0329 -9.9271
##            -9.6387 -13.234 -15.4527 -10.5926 -3.6046 8.0452
## 
##   sigma:  3.8803
## 
##      AIC     AICc      BIC 
## 4044.229 4045.320 4106.066 
## 
## Error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.0875434 3.820267 2.914016 0.05221298 3.025859 0.5788559
##                   ACF1
## Training set 0.1020205
## 
## Forecasts:
##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 2010      105.65965 100.68685 110.63245  98.05441 113.2649
## Feb 2010      103.83701  98.09518 109.57884  95.05564 112.6184
## Mar 2010       98.53152  92.11214 104.95091  88.71392 108.3491
## Apr 2010       92.76025  85.72830  99.79221  82.00580 103.5147
## May 2010       92.80072  85.20544 100.39599  81.18474 104.4167
## Jun 2010       92.96545  84.84584 101.08507  80.54757 105.3833
## Jul 2010       92.98669  84.37460 101.59878  79.81564 106.1577
## Aug 2010      100.15760  91.07972 109.23548  86.27418 114.0410
## Sep 2010      110.00786 100.48694 119.52877  95.44688 124.5688
## Oct 2010      121.53457 111.59034 131.47880 106.32619 136.7430
## Nov 2010      118.34048 107.99024 128.69073 102.51115 134.1698
## Dec 2010      116.24445 105.08495 127.40394  99.17747 133.3114
## Jan 2011      105.65965  94.13688 117.18242  88.03710 123.2822
## Feb 2011      103.83701  91.96208 115.71195  85.67587 121.9982
## Mar 2011       98.53152  86.31457 110.74847  79.84730 117.2157
## Apr 2011       92.76025  80.21060 105.30990  73.56721 111.9533
## May 2011       92.80072  79.92696 105.67448  73.11200 112.4894
## Jun 2011       92.96545  79.77555 106.15536  72.79324 113.1377
## Jul 2011       92.98669  79.48804 106.48534  72.34229 113.6311
## Aug 2011      100.15760  86.35712 113.95808  79.05158 121.2636
## Sep 2011      110.00786  95.91200 124.10371  88.45010 131.5656
## Oct 2011      121.53457 107.14940 135.91974  99.53435 143.5348
## Nov 2011      118.34048 103.67171 133.00926  95.90653 140.7744
## Dec 2011      116.24445 100.99388 131.49501  92.92072 139.5682
## Jan 2012      105.65965  90.14129 121.17801  81.92636 129.3929
## Feb 2012      103.83701  88.05539 119.61863  79.70111 127.9729
## Mar 2012       98.53152  82.49097 114.57208  73.99961 123.0634
## Apr 2012       92.76025  76.46488 109.05563  67.83862 117.6819
## May 2012       92.80072  76.25444 109.34699  67.49538 118.1061
## Jun 2012       92.96545  76.17203 109.75888  67.28213 118.6488
## Jul 2012       92.98669  75.94970 110.02368  66.93087 119.0425
## Aug 2012      100.15760  82.88048 117.43472  73.73453 126.5807
## Sep 2012      110.00786  92.49390 127.52181  83.22257 136.7931
## Oct 2012      121.53457 103.78694 139.28221  94.39191 148.6772
## Nov 2012      118.34048 100.36221 136.31876  90.84508 145.8359
## Dec 2012      116.24445  97.78841 134.70048  88.01838 144.4705
## Jan 2013      105.65965  86.98172 124.33758  77.09422 134.2251
## Feb 2013      103.83701  84.93979 122.73423  74.93621 132.7378
## Mar 2013       98.53152  79.41753 117.64552  69.29919 127.7639
## Apr 2013       92.76025  73.43191 112.08859  63.20011 122.3204
## May 2013       92.80072  73.26038 112.34105  62.91636 122.6851
## Jun 2013       92.96545  73.21540 112.71550  62.76036 123.1705
## Jul 2013       92.98669  73.02913 112.94425  62.46423 123.5091
## Aug 2013      100.15760  79.99466 120.32054  69.32104 130.9942
## Sep 2013      110.00786  89.64161 130.37410  78.86037 141.1553
## Oct 2013      121.53457 100.96703 142.10212  90.07922 152.9899
## Nov 2013      118.34048  97.57359 139.10738  86.58026 150.1007
## Dec 2013      116.24445  95.06260 137.42629  83.84961 148.6393
## Jan 2014      105.65965  84.28419 127.03512  72.96870 138.3506
## Feb 2014      103.83701  82.26967 125.40436  70.85261 136.8214
## Mar 2014       98.53152  76.77399 120.28905  65.25625 131.8068
## Apr 2014       92.76025  70.81418 114.70632  59.19663 126.3239
## May 2014       92.80072  70.66771 114.93372  58.95120 126.6502
## Jun 2014       92.96545  70.64708 115.28383  58.83244 127.0985
## Jul 2014       92.98669  70.48447 115.48891  58.57252 127.4009
## Aug 2014      100.15760  77.47303 122.84217  65.46455 134.8507
## Sep 2014      110.00786  87.14239 132.87332  75.03814 144.9776
## Oct 2014      121.53457  98.48963 144.57951  86.29037 156.7788
## Nov 2014      118.34048  95.11745 141.56352  82.82392 153.8570
## Dec 2014      116.24445  92.64961 139.83928  80.15927 152.3296
## Jan 2015      105.65965  81.89085 129.42845  69.30841 142.0109
## Feb 2015      103.83701  79.89551 127.77852  67.22164 140.4524
## Mar 2015       98.53152  74.41854 122.64450  61.65391 135.4091
## Apr 2015       92.76025  68.47701 117.04349  55.62225 129.8983
## May 2015       92.80072  68.34841 117.25302  55.40414 130.1973
## Jun 2015       92.96545  68.34523 117.58568  55.31207 130.6188
## Jul 2015       92.98669  68.19969 117.77369  55.07825 130.8951
## Aug 2015      100.15760  75.20494 125.11026  61.99580 138.3194
## Sep 2015      110.00786  84.89063 135.12508  71.59438 148.4213
## Oct 2015      121.53457  96.25385 146.81529  82.87105 160.1981
## Nov 2015      118.34048  92.89732 143.78365  79.42852 157.2524
## Dec 2015      116.24445  90.46148 142.02741  76.81280 155.6761
## Jan 2016      105.65965  79.71738 131.60192  65.98437 145.3349
## Feb 2016      103.83701  77.73641 129.93761  63.91959 143.7544
## Mar 2016       98.53152  72.27355 124.78949  58.37342 138.6896
## Apr 2016       92.76025  66.34584 119.17466  52.36290 133.1576
## May 2016       92.80072  66.23079 119.37064  52.16552 133.4359
## Jun 2016       92.96545  66.24092 119.68999  52.09381 133.8371
## Jul 2016       92.98669  66.10843 119.86495  51.87995 134.0934
## Aug 2016      100.15760  73.12650 127.18870  58.81710 141.4981
## Sep 2016      110.00786  82.82477 137.19095  68.43491 151.5808
## Oct 2016      121.53457  94.20034 148.86880  79.73047 163.3387
## Nov 2016      118.34048  90.85594 145.82503  76.30650 160.3745
## Dec 2016      116.24445  88.44504 144.04385  73.72893 158.7600
## Jan 2017      105.65965  77.71244 133.60686  62.91808 148.4012
## Feb 2017      103.83701  75.74277 131.93126  60.87057 146.8035
## Mar 2017       98.53152  70.29101 126.77203  55.34139 141.7217
## Apr 2017       92.76025  64.37423 121.14628  49.34757 136.1729
## May 2017       92.80072  64.26992 121.33151  49.16663 136.4348
## Jun 2017       92.96545  64.29062 121.64029  49.11108 136.8198
## Jul 2017       92.98669  64.16854 121.80485  48.91313 137.0603
## Aug 2017      100.15760  71.19683 129.11836  55.86593 144.4493
accuracy(m_ets, test)
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.0875434 3.820267 2.914016 0.05221298 3.025859 0.5788559
## Test set     1.5358342 6.177580 4.964258 1.20188945 4.749965 0.9861270
##                   ACF1 Theil's U
## Training set 0.1020205        NA
## Test set     0.7363525 0.9024974
autoplot(train)+
autolayer(m_ses, series="Simple Exponential", PI=FALSE)+
autolayer(m_holt, series="Holt Method", PI=FALSE)+
autolayer(m_holtw, series="Holt_Winters", PI=FALSE)+
autolayer(m_ets, series="ETS", PI=FALSE)+
xlab('Month/Year')+ylab('Candy Production')+
ggtitle('Forecasts for Candy Production')+
guides(colour=guide_legend(title='Forecast'))

When evaluated against the test set, the Holt-Winters has better RMSE and MAPE values.

4 - ARIMA

autoarima <- auto.arima(train, allowdrift=F)
autoarima
## Series: train 
## ARIMA(4,1,1)(2,1,2)[12] 
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ma1     sar1    sar2     sma1
##       0.6373  0.0778  0.2031  -0.0884  -0.9719  -0.4630  0.0364  -0.1129
## s.e.  0.0519  0.0572  0.0580   0.0499   0.0226   0.4073  0.1150   0.4037
##          sma2
##       -0.4569
## s.e.   0.3127
## 
## sigma^2 = 13.59:  log likelihood = -1206.69
## AIC=2433.38   AICc=2433.89   BIC=2474.32
arima=Arima(train, order=c(4,1,1),
           seasonal=list(order=c(2,1,2), period=12))
arima
## Series: train 
## ARIMA(4,1,1)(2,1,2)[12] 
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ma1     sar1    sar2     sma1
##       0.6373  0.0778  0.2031  -0.0884  -0.9719  -0.4630  0.0364  -0.1129
## s.e.  0.0519  0.0572  0.0580   0.0499   0.0226   0.4073  0.1150   0.4037
##          sma2
##       -0.4569
## s.e.   0.3127
## 
## sigma^2 = 13.59:  log likelihood = -1206.69
## AIC=2433.38   AICc=2433.89   BIC=2474.32
arima_f =forecast(arima, h=92)
checkresiduals(arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,1)(2,1,2)[12]
## Q* = 32.667, df = 15, p-value = 0.005218
## 
## Model df: 9.   Total lags used: 24

Since the p-value of the Ljung-Box test is 0.005218, we have sufficient evidence to reject the null hypothesis and conclude that autocorrelation is present.

residuals <- residuals(arima)
Acf(residuals)

p_resid <- periodogram(residuals)

The values in the ACF plot are mostly within the dashed thresholds, and the periodogram shows all frequency levels present so it seems like the residuals are mostly white noise.

accuracy(arima_f, test)
##                     ME      RMSE       MAE        MPE      MAPE      MASE
## Training set -0.129378  3.597095  2.701688 -0.1833357  2.778181 0.5366779
## Test set     12.415226 15.640364 12.770918 11.6386579 12.011212 2.5368841
##                     ACF1 Theil's U
## Training set 0.005813222        NA
## Test set     0.866520116  2.235618
autoplot(train)+
autolayer(arima_f, series="ARIMA", PI=FALSE)+
xlab('Month/Year')+ylab('Candy Production')+
ggtitle('Forecasts for Candy Production')+
guides(colour=guide_legend(title='Forecast'))

Model Comparisons

library(tibble)

# Create forecasts for each model
forecasts <- list(
  Mean = m_mean,
  Naive = m_naive,
  Seasonal_Naive = m_snaive,
  ETS = m_ets,
  Holt = m_holt,
  Holt_Winters = m_holtw,
  Simple_Exponential = m_ses,
  ARIMA = arima_f
)

# Calculate RMSE for each forecast
rmse_values <- sapply(forecasts, function(forecast) {
  accuracy_values <- accuracy(forecast, test)
  rmse <- accuracy_values[2, "RMSE"]
  return(rmse)
})

# Create a table of RMSE values
model_names <- names(forecasts)
rmse_table <- tibble(Model = model_names, RMSE = rmse_values)

# Sort the table by RMSE in ascending order
rmse_table <- rmse_table[order(rmse_table$RMSE), ]

# Calculate MAPE for each forecast
mape_values <- sapply(forecasts, function(forecast) {
  accuracy_values <- accuracy(forecast, test)
  mape <- accuracy_values[2, "MAPE"]
  return(mape)
})

# Create a table of MAPE values
model_names <- names(forecasts)
mape_table <- tibble(Model = model_names, MAPE = mape_values)

# Sort the table by MAPE in ascending order
mape_table <- mape_table[order(mape_table$MAPE), ]

# Combine RMSE and MAPE tables
combined_table <- left_join(rmse_table, mape_table, by = "Model")

# Format the combined table using kable
kable(combined_table, align = c("l", "r", "r"), caption = "RMSE and MAPE values for each forecast model")
RMSE and MAPE values for each forecast model
Model RMSE MAPE
Holt_Winters 5.55067 4.467333
ETS 6.17758 4.749965
Mean 12.11305 8.981927
Seasonal_Naive 14.00339 11.712190
ARIMA 15.64036 12.011212
Naive 16.11966 13.805155
Simple_Exponential 16.11970 13.805191
Holt 17.53192 15.326538

After Seasonal ARIMA, the best RMSE and MAPE was found with the Holt-Winters method. For future analysis, may be a good option to work with ensemble methods to combine the predictions of multiple models.